Goto

Collaborating Authors

 hessian-vector product


Reducing Reparameterization Gradient Variance

Andrew Miller, Nick Foti, Alexander D'Amour, Ryan P. Adams

Neural Information Processing Systems

Optimization with noisy gradients has become ubiquitous in statistics and machine learning. Reparameterization gradients, or gradient estimates computed via the "reparameterization trick," represent a class of noisy gradients often used in Monte Carlo variational inference (MCVI). However, when these gradient estimators are too noisy, the optimization procedure can be slow or fail to converge. One way to reduce noise is to generate more samples for the gradient estimate, but this can be computationally expensive. Instead, we view the noisy gradient as a random variable, and form an inexpensive approximation of the generating procedure for the gradient sample. This approximation has high correlation with the noisy gradient by construction, making it a useful control variate for variance reduction. We demonstrate our approach on a non-conjugate hierarchical model and a Bayesian neural net where our method attained orders of magnitude (20-2,000) reduction in gradient variance resulting in faster and more stable optimization.



NEON2: Finding Local Minima via First-Order Oracles

Zeyuan Allen-Zhu, Yuanzhi Li

Neural Information Processing Systems

We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance. As applications, our reduction turns Natasha2 into a first-order method without hurting its theoretical performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results.



NEON2: Finding Local Minima via First-Order Oracles

Zeyuan Allen-Zhu, Yuanzhi Li

Neural Information Processing Systems

We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance. As applications, our reduction turns Natasha2 into a first-order method without hurting its theoretical performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results.


A Algorithms

Neural Information Processing Systems

Below we include detailed pseudocode for algorithms described in the main text.Algorithm 2 Parameter Free DeltaShift Input: Implicit matrix-vector multiplication access to A In this section, we give a full proof of Theorem 1.1 with the correct logarithmic dependence on Before doing so, we collect several definitions and results required for proving the theorem. As discussed, a tight analysis of Hutchinson's estimator, and also our DeltaShift algorithm, relies Finally, from Claim B.2, we immediately have Rademacher random vectors, a similar analysis can be performed for any i.i.d. Now, we are ready to move on to the main result. The proof is by induction. We claim that, for all j = 1,...,m, t Next consider the inductive case.


iMAML algorithm perform better than MAML

Neural Information Processing Systems

We thank the reviewers for the thoughtful feedback! Reviewer #1: Thank you for the thoughtful questions! We do not require convexity of L anywhere. Furthermore, regularity conditions are often needed for analysis but not to run the algorithm. Similarly, iMAML shows promising empirical results.


GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification: Supplementary Material

Neural Information Processing Systems

Speedups were tested for both batch gradient descent (with a 0.001 learning rate) and L-BFGS . Let 1 denote the indicator function. TRON is detailed in Algorithm 1. The other direction is slightly different.